Skip to content

PPC64 ASM: AES-ECB/CBC/CTR/GCM#9852

Open
SparkiDev wants to merge 1 commit intowolfSSL:masterfrom
SparkiDev:ppc64_asm_aes
Open

PPC64 ASM: AES-ECB/CBC/CTR/GCM#9852
SparkiDev wants to merge 1 commit intowolfSSL:masterfrom
SparkiDev:ppc64_asm_aes

Conversation

@SparkiDev
Copy link
Contributor

Description

To turn on assembly:
--enable-ppc64-asm
To build C code:
--enable-ppc64-asm=inline

To disable hardening (when physical access to device is not possible):
WOLFSSL_PPC64_ASM_AES_NO_HARDEN

AES-GCM works with either 4-bit (default) or table:
--enable-aesgcm=table
Using 'table' is faster for encryption/decryption.

Testing

./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr CFLAGS=-DWOLFSSL_PPC64_ASM_AES_NO_HARDEN --enable-ppc64-asm
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr CFLAGS=-DWOLFSSL_PPC64_ASM_AES_NO_HARDEN --enable-ppc64-asm=inline
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr --enable-ppc64-asm
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr --enable-ppc64-asm=inline
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr

@SparkiDev SparkiDev self-assigned this Mar 3, 2026
@SparkiDev
Copy link
Contributor Author

PPC64 assembly code generated with PR:
https://github.com/wolfSSL/scripts/pull/556

@SparkiDev SparkiDev force-pushed the ppc64_asm_aes branch 2 times, most recently from fcf8f3e to b606231 Compare March 3, 2026 04:23
@SparkiDev
Copy link
Contributor Author

retest this please

@dgarske dgarske self-requested a review March 3, 2026 20:16
@dgarske
Copy link
Contributor

dgarske commented Mar 5, 2026

Initial benchmarks on an NXP T2080 (e6500) core with 1.8GHz core clock:

With PR 9852:

AES-256-GCM-enc          13 MiB took 1.000 seconds, 13.051 MiB/s
AES-256-GCM-dec          13 MiB took 1.001 seconds, 13.044 MiB/s

With master:

AES-256-GCM-enc          15 MiB took 1.000 seconds, 15.305 MiB/s
AES-256-GCM-dec          7 MiB took 1.001 seconds, 7.901 MiB/s

@dgarske
Copy link
Contributor

dgarske commented Mar 5, 2026

Initial benchmarks on an NXP T2080 (e6500) core with 1.8GHz core clock:

With PR 9852:

AES-256-GCM-enc          13 MiB took 1.000 seconds, 13.051 MiB/s
AES-256-GCM-dec          13 MiB took 1.001 seconds, 13.044 MiB/s

With master:

AES-256-GCM-enc          15 MiB took 1.000 seconds, 15.305 MiB/s
AES-256-GCM-dec          7 MiB took 1.001 seconds, 7.901 MiB/s

Oh I did not try with WOLFSSL_PPC64_ASM_AES_NO_HARDEN . I also had 4 bit table not --enable-aesgcm-table. Let me run a few more tests.

@dgarske
Copy link
Contributor

dgarske commented Mar 5, 2026

-O3, AES GCM Table, SHA256 C

Master:

AES-128-CBC-enc          63 MiB took 1.000 seconds, 63.275 MiB/s
AES-128-CBC-dec          65 MiB took 1.000 seconds, 65.966 MiB/s
AES-192-CBC-enc          55 MiB took 1.000 seconds, 55.034 MiB/s
AES-192-CBC-dec          57 MiB took 1.000 seconds, 57.055 MiB/s
AES-256-CBC-enc          48 MiB took 1.000 seconds, 48.796 MiB/s
AES-256-CBC-dec          50 MiB took 1.000 seconds, 50.359 MiB/s
AES-128-GCM-enc          16 MiB took 1.001 seconds, 16.871 MiB/s
AES-128-GCM-dec          8 MiB took 1.000 seconds, 8.488 MiB/s
AES-192-GCM-enc          16 MiB took 1.000 seconds, 16.227 MiB/s
AES-192-GCM-dec          8 MiB took 1.000 seconds, 8.318 MiB/s
AES-256-GCM-enc          15 MiB took 1.000 seconds, 15.618 MiB/s
AES-256-GCM-dec          8 MiB took 1.001 seconds, 8.145 MiB/s
AES-128-GCM-enc-no_AAD   17 MiB took 1.000 seconds, 17.073 MiB/s
AES-128-GCM-dec-no_AAD   8 MiB took 1.000 seconds, 8.537 MiB/s
AES-192-GCM-enc-no_AAD   16 MiB took 1.000 seconds, 16.392 MiB/s
AES-192-GCM-dec-no_AAD   8 MiB took 1.001 seconds, 8.365 MiB/s
AES-256-GCM-enc-no_AAD   15 MiB took 1.000 seconds, 15.786 MiB/s
AES-256-GCM-dec-no_AAD   8 MiB took 1.001 seconds, 8.190 MiB/s
GMAC Table               22 MiB took 1.000 seconds, 22.948 MiB/s
SHA-256                  79 MiB took 1.000 seconds, 79.707 MiB/s
SHA-384                  36 MiB took 1.000 seconds, 36.723 MiB/s
SHA-512                  36 MiB took 1.000 seconds, 36.743 MiB/s
SHA-512/224              36 MiB took 1.000 seconds, 36.761 MiB/s
SHA-512/256              36 MiB took 1.000 seconds, 36.757 MiB/s
HMAC-SHA256              79 MiB took 1.000 seconds, 79.020 MiB/s
HMAC-SHA384              36 MiB took 1.000 seconds, 36.194 MiB/s
HMAC-SHA512              36 MiB took 1.000 seconds, 36.188 MiB/s

PR 9852 with WOLFSSL_PPC64_ASM WOLFSSL_PPC64_ASM_INLINE WOLFSSL_PPC64_ASM_SMALL WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC32_ASM WOLFSSL_PPC32_ASM_INLINE WOLFSSL_PPC32_ASM_SMALL

ES-128-CBC-enc          69 MiB took 1.000 seconds, 69.060 MiB/s
AES-128-CBC-dec          73 MiB took 1.000 seconds, 73.363 MiB/s
AES-192-CBC-enc          59 MiB took 1.000 seconds, 59.358 MiB/s
AES-192-CBC-dec          62 MiB took 1.000 seconds, 62.510 MiB/s
AES-256-CBC-enc          52 MiB took 1.000 seconds, 52.017 MiB/s
AES-256-CBC-dec          54 MiB took 1.000 seconds, 54.347 MiB/s
AES-128-GCM-enc          17 MiB took 1.000 seconds, 17.891 MiB/s
AES-128-GCM-dec          17 MiB took 1.001 seconds, 17.922 MiB/s
AES-192-GCM-enc          17 MiB took 1.001 seconds, 17.143 MiB/s
AES-192-GCM-dec          17 MiB took 1.000 seconds, 17.179 MiB/s
AES-256-GCM-enc          16 MiB took 1.000 seconds, 16.479 MiB/s
AES-256-GCM-dec          16 MiB took 1.000 seconds, 16.512 MiB/s
AES-128-GCM-enc-no_AAD   18 MiB took 1.001 seconds, 18.092 MiB/s
AES-128-GCM-dec-no_AAD   18 MiB took 1.000 seconds, 18.130 MiB/s
AES-192-GCM-enc-no_AAD   17 MiB took 1.001 seconds, 17.334 MiB/s
AES-192-GCM-dec-no_AAD   17 MiB took 1.000 seconds, 17.369 MiB/s
AES-256-GCM-enc-no_AAD   16 MiB took 1.001 seconds, 16.654 MiB/s
AES-256-GCM-dec-no_AAD   16 MiB took 1.000 seconds, 16.687 MiB/s
GMAC Table               24 MiB took 1.000 seconds, 24.648 MiB/s
SHA-256                  67 MiB took 1.000 seconds, 67.083 MiB/s
SHA-384                  36 MiB took 1.000 seconds, 36.714 MiB/s
SHA-512                  36 MiB took 1.000 seconds, 36.720 MiB/s
SHA-512/224              36 MiB took 1.000 seconds, 36.668 MiB/s
SHA-512/256              36 MiB took 1.000 seconds, 36.671 MiB/s
HMAC-SHA256              66 MiB took 1.000 seconds, 66.476 MiB/s
HMAC-SHA384              36 MiB took 1.000 seconds, 36.123 MiB/s
HMAC-SHA512              36 MiB took 1.000 seconds, 36.122 MiB/s

PR 9852 with WOLFSSL_PPC64_ASM WOLFSSL_PPC64_ASM_INLINE WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC32_ASM WOLFSSL_PPC32_ASM_INLINE

AES-128-CBC-enc          69 MiB took 1.000 seconds, 69.025 MiB/s
AES-128-CBC-dec          73 MiB took 1.000 seconds, 73.354 MiB/s
AES-192-CBC-enc          59 MiB took 1.000 seconds, 59.333 MiB/s
AES-192-CBC-dec          62 MiB took 1.000 seconds, 62.503 MiB/s
AES-256-CBC-enc          52 MiB took 1.000 seconds, 52.133 MiB/s
AES-256-CBC-dec          54 MiB took 1.000 seconds, 54.351 MiB/s
AES-128-GCM-enc          17 MiB took 1.000 seconds, 17.882 MiB/s
AES-128-GCM-dec          17 MiB took 1.000 seconds, 17.914 MiB/s
AES-192-GCM-enc          17 MiB took 1.000 seconds, 17.146 MiB/s
AES-192-GCM-dec          17 MiB took 1.000 seconds, 17.175 MiB/s
AES-256-GCM-enc          16 MiB took 1.000 seconds, 16.467 MiB/s
AES-256-GCM-dec          16 MiB took 1.001 seconds, 16.510 MiB/s
AES-128-GCM-enc-no_AAD   18 MiB took 1.001 seconds, 18.094 MiB/s
AES-128-GCM-dec-no_AAD   18 MiB took 1.001 seconds, 18.119 MiB/s
AES-192-GCM-enc-no_AAD   17 MiB took 1.001 seconds, 17.339 MiB/s
AES-192-GCM-dec-no_AAD   17 MiB took 1.001 seconds, 17.363 MiB/s
AES-256-GCM-enc-no_AAD   16 MiB took 1.000 seconds, 16.641 MiB/s
AES-256-GCM-dec-no_AAD   16 MiB took 1.000 seconds, 16.684 MiB/s
GMAC Table               24 MiB took 1.000 seconds, 24.648 MiB/s
SHA-256                  70 MiB took 1.000 seconds, 70.384 MiB/s
SHA-384                  36 MiB took 1.000 seconds, 36.681 MiB/s
SHA-512                  36 MiB took 1.000 seconds, 36.669 MiB/s
SHA-512/224              36 MiB took 1.000 seconds, 36.719 MiB/s
SHA-512/256              36 MiB took 1.000 seconds, 36.720 MiB/s
HMAC-SHA256              69 MiB took 1.000 seconds, 69.747 MiB/s
HMAC-SHA384              36 MiB took 1.000 seconds, 36.155 MiB/s
HMAC-SHA512              36 MiB took 1.000 seconds, 36.159 MiB/s

dgarske
dgarske previously approved these changes Mar 5, 2026
Copy link
Contributor

@dgarske dgarske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks posted. Marking approved, but won't consider merge until you have a chance to evaluate results. I will also work on running on an e5500 core.

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

Excellent — PPC64 ASM AES is something we have been wanting. We use TLS extensively for our RustChain blockchain attestation nodes and Ergo anchor transactions.

Available for testing:

  • IBM POWER8 S824 — ppc64 (big-endian), 16c/128t, 512GB RAM, GCC 10
  • Power Mac G5 — ppc64 big-endian, Dual 2.0GHz 970

Would be happy to benchmark AES-GCM and AES-CTR throughput on POWER8 before and after this PR. Let us know if test results from real hardware would be useful for review.

@SparkiDev
Copy link
Contributor Author

Hi David,

Please run the performance numbers with the latest version of the code.

Thanks!
Sean

@SparkiDev
Copy link
Contributor Author

SparkiDev commented Mar 9, 2026

Hi @Scottcjn,

I have implemented AES-ECB/CBC/CTR/GCM.
If you have time to generate the performance numbers for these modes on any available computers, it would be appreciated.
First though, I need to get the assembly code working on those machines.
Please let me know what compilation errors you see using this code.

Thanks,
Sean

@dgarske dgarske self-requested a review March 9, 2026 02:56
@SparkiDev
Copy link
Contributor Author

retest this please

@SparkiDev
Copy link
Contributor Author

retest this please

@dgarske
Copy link
Contributor

dgarske commented Mar 9, 2026

Hi David,

Please run the performance numbers with the latest version of the code.

Thanks! Sean

I needed this patch:

$ git diff
diff --git a/wolfcrypt/src/aes.c b/wolfcrypt/src/aes.c
index 5d36c2f4d..557b57510 100644
--- a/wolfcrypt/src/aes.c
+++ b/wolfcrypt/src/aes.c
@@ -893,7 +893,8 @@ static WARN_UNUSED_RESULT int wc_AesDecrypt(Aes* aes, const byte* inBlock,
 #elif defined(WOLFSSL_PPC64_ASM)
 
 #if defined(WOLFSSL_AES_DIRECT) || defined(HAVE_AESCCM) || \
-    defined(WOLFSSL_AESGCM_STREAM) || defined(HAVE_AES_ECB)
+    defined(WOLFSSL_AESGCM_STREAM) || defined(HAVE_AES_ECB) || \
+    defined(HAVE_AESGCM)
 static WARN_UNUSED_RESULT int wc_AesEncrypt(Aes* aes, const byte* inBlock,
     byte* outBlock)
 {

Here are the results on an NXP T1040 e5500 at 1.4GHz running Linux

Symmetric Ciphers (MiB/s)

Algorithm master pr9852 Delta
AES-128-CBC-enc 47.50 53.92 +14%
AES-128-CBC-dec 46.31 44.63 -4%
AES-192-CBC-enc 41.13 45.75 +11%
AES-192-CBC-dec 41.10 37.75 -8%
AES-256-CBC-enc 36.26 38.40 +6%
AES-256-CBC-dec 36.03 32.57 -10%
AES-128-GCM-enc 23.63 34.68 +47%
AES-128-GCM-dec 9.91 34.94 +252%
AES-192-GCM-enc 21.94 31.35 +43%
AES-192-GCM-dec 9.60 31.34 +227%
AES-256-GCM-enc 20.47 27.82 +36%
AES-256-GCM-dec 9.31 27.88 +200%
GMAC Table 46.19 100.80 +118%
  • AES-GCM decrypt: +200 to +252%
  • AES-GCM encrypt: +36 to +47%
  • GMAC Table: +118% (reaches ~101 MiB/s)
  • AES-CBC encrypt: +6 to +14%
  • AES-CBC decrypt: slight regression of 4-10% -- worth investigating

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

POWER8 S824 AES Benchmark Results

Hardware: IBM Power System S824 (8286-42A) — Dual 8-core POWER8, 512GB RAM, Ubuntu 20.04
Build: GCC 9.4.0, -mcpu=power8 -mvsx -maltivec
Config: --enable-aescbc --enable-aesctr --enable-aesgcm --enable-benchmark

Algorithm Baseline master (MiB/s) PR #9852 ASM (MiB/s) Change
AES-128-CBC-enc 196.1 94.2 -52%
AES-128-CBC-dec 194.8 189.7 -3%
AES-192-CBC-enc 167.9 77.9 -54%
AES-192-CBC-dec 165.1 158.3 -4%
AES-256-CBC-enc 146.8 66.3 -55%
AES-256-CBC-dec 142.5 134.7 -5%
AES-128-GCM-enc 99.3 46.3 -53%
AES-128-GCM-dec 16.7 20.6 +23%
AES-192-GCM-enc 91.6 24.1 -74%
AES-192-GCM-dec 16.4 41.6 +153%
AES-256-GCM-enc 84.9 51.4 -40%
AES-256-GCM-dec 16.2 53.2 +228%
GMAC Table 4-bit 204.6 214.1 +5%
AES-128-CTR 181.4 94.2 -48%
AES-192-CTR 149.2 78.1 -48%
AES-256-CTR 131.1 66.5 -49%

Observations

  1. Encryption regressions: CBC-enc, CTR, and GCM-enc all show ~48-55% regressions on POWER8
  2. GCM decryption improved: Baseline GCM-dec was oddly slow (~16 MiB/s); the PR fixes this significantly (+23% to +228%)
  3. CBC/CTR decryption: Minor regressions of 3-5%

Analysis

This PR uses scalar T-table AES with GPR instructions rather than the hardware vcipher/vcipherlast crypto instructions available since POWER8 (ISA 2.07, 2013). The T-table approach requires expensive cache-line preloading for side-channel mitigation (64 dummy loads per round when hardened), which accounts for much of the overhead.

We verified that vcipher/vcipherlast compile and run correctly on our POWER8 S824 — see PR #9932 for a hardware crypto implementation achieving 3,595 MiB/s on AES-128-CTR (13.7x faster than T-table).

The GCM decryption improvement suggests the baseline C path had a performance issue there that the ASM correctly addresses.

Happy to run additional configurations or ECB benchmarks (--enable-ecb) if helpful.

Tested on real ironElyan Labs POWER8 infrastructure.

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

Update: POWER8 hardware AES implementation submitted

Following up on my benchmark results above — I've submitted PR #9932 which uses POWER8's hardware AES crypto instructions (vcipher/vcipherlast from ISA 2.07) instead of scalar T-table lookups.

Quick comparison (AES-128 on POWER8 S824):

Mode This PR T-table (MiB/s) PR #9932 vcipher (MiB/s) Speedup
CBC-enc 267 484 1.8x
CBC-dec 213 2,796 13.2x
CTR 262 3,595 13.7x
ECB 265 2,931 11.0x

The key insight: POWER8 (ISA 2.07, 2013) introduced vcipher/vcipherlast — single-cycle hardware AES round instructions in the vector crypto unit. Using an 8-way interleaved pipeline fills the 7-cycle latency gap, achieving near-theoretical throughput.

The hardware crypto approach is also inherently side-channel resistant (no data-dependent memory accesses), so no cache-line preloading is needed.

I also found a GMAC correctness bug: testwolfcrypt GMAC test fails at line 18271 when PPC64 ASM is enabled on ppc64le (tested with both hardened and unhardened, both GCM table modes).

Happy to collaborate on merging the approaches — your key expansion and GCM GHASH table work could complement the hardware crypto path nicely. @SparkiDev

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

Hi Sean,

Built and benchmarked your latest code on our IBM POWER8 S824 (dual 8-core POWER8, 512 GB RAM, Ubuntu 20.04, GCC 9.4).

Build Notes

Compiled with:

./configure --enable-aescbc --enable-aesgcm --enable-aesctr --enable-ppc64-asm \
  --enable-static --disable-shared \
  CFLAGS="-O2 -mcpu=power8 -fno-pie" LDFLAGS="-no-pie"

Important: The assembly uses R_PPC64_ADDR16_HA absolute relocations for the T-table data, which fails with PIE (Position Independent Executables). Required -fno-pie -no-pie to link. This will be a problem for distributions that enable PIE by default (most modern distros). The assembly would need TOC-relative addressing (@toc@ha/@toc@l) to be PIC-compatible.

Benchmark Results (POWER8 S824, -O2 -mcpu=power8)

AES-128-CBC-enc     95 MiB/s
AES-128-CBC-dec    191 MiB/s
AES-192-CBC-enc     78 MiB/s
AES-192-CBC-dec    159 MiB/s
AES-256-CBC-enc     66 MiB/s
AES-256-CBC-dec    137 MiB/s
AES-128-GCM-enc     70 MiB/s
AES-128-GCM-dec     70 MiB/s
AES-192-GCM-enc     60 MiB/s
AES-192-GCM-dec     60 MiB/s
AES-256-GCM-enc     53 MiB/s
AES-256-GCM-dec     53 MiB/s
AES-128-CTR         94 MiB/s
AES-192-CTR         78 MiB/s
AES-256-CTR         67 MiB/s
GMAC Table 4-bit   264 MiB/s

GMAC Test Failure (Still Present)

testwolfcrypt GMAC test still fails on this revision:

GMAC     test failed!
 error L=18271

Hardware AES Comparison

For reference, here are the numbers from our PR #9932 using vcipher/vcipherlast hardware crypto instructions (ISA 2.07) on the same machine:

Mode This PR (T-table ASM) PR #9932 (vcipher HW) Speedup
AES-128-CBC-enc 95 MiB/s 960 MiB/s 10.1x
AES-128-CBC-dec 191 MiB/s 5,550 MiB/s 29.1x
AES-128-CTR 94 MiB/s 5,217 MiB/s 55.5x
AES-256-CTR 67 MiB/s 3,866 MiB/s 57.7x
AES-128-ECB 5,819 MiB/s

The POWER8 ISA 2.07 vcipher/vcipherlast instructions execute AES rounds in the vector crypto unit — single-cycle throughput with 7-cycle latency, which an 8-way interleaved pipeline fills completely. The hardware path also eliminates side-channel risk from T-table lookups.

Happy to run any additional tests or configurations. Would be great to see the T-table approach used as a fallback for pre-POWER8 chips (e6500, etc.) with the hardware crypto path for POWER8+.

— Scott

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

Lagniappe: vec_perm AES on Power Mac G4 (no hardware crypto needed)

A little something extra — we ran a pure AltiVec vec_perm AES implementation on a 2002 Power Mac G4 Dual (7450 @ 1.25 GHz, Mac OS X Tiger 10.4, GCC 4.0.1).

This uses vec_perm (vector permute, available since AltiVec 1999) for SubBytes via nibble-indexed S-box table lookup. No vcipher, no PPC64 instructions — runs on any PowerPC with AltiVec.

G4 Results (NIST FIPS-197 test vector verified ✅)

AES-128-ECB (vec_perm 1-way):  4.8 MiB/s
AES-128-ECB (scalar C ref):    3.6 MiB/s    ← 1.33x speedup

POWER8 S824 Results (same code, no vcipher used)

AES-128-ECB (vec_perm 1-way):   58.8 MiB/s
AES-128-ECB (vec_perm 4-way):  312.3 MiB/s  ← 3.3x faster than T-table ASM
AES-128-ECB (scalar C ref):     25.0 MiB/s

Why this matters

The ppc64-aes-asm.S in this PR requires PPC64 (std/ld) and won't assemble on 32-bit PowerPC (G4, G5, e500). The vec_perm approach provides:

  1. Broader hardware support — any AltiVec chip (G4, G5, POWER7, POWER8)
  2. Constant-time execution — no data-dependent memory accesses (no T-table cache timing side channels)
  3. No -fno-pie needed — no absolute relocations, fully PIC-compatible

This is the unoptimized "half-table" method (16 vec_perm passes per SubBytes). The Hamburg algebraic decomposition (GF(2^4) tower field via vec_perm) would reduce this to ~6 vec_perm ops, roughly 2.5x faster.

Technique

SubBytes:   16x vec_perm (nibble-indexed S-box lookup tables)
ShiftRows:  1x vec_perm (byte permutation — one instruction!)
MixColumns: xtime via vec_sra + vec_perm column rotation

Code (standalone, ~250 lines, compiles with gcc -maltivec -O2)

The full source is at: https://github.com/Scottcjn/wolfssl/blob/power8-hw-aes/wolfcrypt/src/port/ppc64/vec_perm_aes.c

(Will push shortly — wanted to share the results first)

The ideal architecture for wolfSSL PPC AES would be a three-tier dispatch:

  1. POWER8+: vcipher/vcipherlast (5,000+ MiB/s) — PR feat: POWER8 hardware AES via vcipher/vcipherlast (ISA 2.07) — 13x speedup #9932
  2. AltiVec (G4/G5/POWER7): vec_perm S-box decomposition (5-312 MiB/s depending on clock)
  3. Scalar fallback: T-table for chips without AltiVec

— Scott

@Scottcjn
Copy link

Scottcjn commented Mar 9, 2026

Hi Sean,

Thank you — that means a lot coming from the wolfSSL team. I'll reach out to support for the contributor agreement right away.

Happy to continue testing on our POWER8 S824 and vintage PowerPC hardware as needed. We do a lot of work with hardware-level crypto and SIMD optimization at Elyan Labs and would welcome the opportunity to contribute further to wolfSSL's PowerPC support down the road.

Looking forward to getting the CLA sorted and this merged.

Best,
Scott Boudreaux
Elyan Labs

@SparkiDev
Copy link
Contributor Author

Added XTS.
Improved performance by a little bit.

To turn on assembly:
  --enable-ppc64-asm
To build C code:
  --enable-ppc64-asm=inline

To disable hardening (when physical access to device is not possible):
  WOLFSSL_PPC64_ASM_AES_NO_HARDEN

AES-GCM works with either 4-bit (default) or table:
  --enable-aesgcm=table
Using 'table' is faster for encryption/decryption.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants